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1  Introduction 

Ranking  is  a  central  problem  in  information  retrieval.  Modern  search  engines,  especially  those  designed  for 
the  World  Wide  Web,  commonly  analyze  and  combine  hundreds  of  features  extracted  from  the  submitted 
query  and  underlying  documents  in  order  to  assess  the  relative  relevance  of  a  document  to  a  given  query  and 
thus  rank  the  underlying  collection.  The  sheer  size  of  this  problem  has  led  to  the  development  of  learning- 
to-rank  (LTR)  algorithms  that  can  automate  the  construction  of  such  ranking  functions:  Given  a  training 
set  of  (feature  vector,  relevance)  pairs,  a  machine  learning  procedure  learns  how  to  combine  the  query  and 
document  features  in  such  a  way  so  as  to  effectively  assess  the  relevance  of  any  document  to  any  query  and 
thus  rank  a  collection  in  response  to  a  user  input. 

Much  thought  and  research  has  been  placed  on  the  development  of  sophisticated  learning-to-rank  algo¬ 
rithms.  However,  relatively  little  research  has  been  conducted  on  the  construction  of  appropriate  learning- 
to-rank  data  sets  nor  on  the  effect  of  these  data  sets  on  the  ability  of  a  learning-to-rank  algorithm  to  “learn” 
effectively. 

Given  that  the  IR  technology  is  ubiquitous  in  a  vast  variety  of  contexts  and  environments  it  is  not  unrea¬ 
sonable  to  assume  that  searchable  material  (corpora)  and  user  information  needs  will  radically  vary  from  one 
retrieval  environment  to  another.  Theoretically,  ranking  functions  should  be  trained  over  collections  with 
similar  characteristics  as  the  collections  they  will  be  deployed  in.  However,  the  ability  to  construct  different 
ranking  functions  for  different  retrieval  environments  is  limited  by  the  cost  of  constructing  such  customized 
training  collections.  Thus,  the  question  that  naturally  arises  is  whether  training  on  a  collection  of  certain 
characteristics  can  still  lead  to  an  effective  ranking  function  over  collections  of  different  characteristics.  To 
answer  this  question  we  trained  our  ranking  functions  (by  employing  SVM)  over  two  different  collections, 
(a)  the  Million  Query  2008  (MQ08)  collection  (GOV2  corpus  and  queries  with  at  least  one  click  on  doc¬ 
uments  in  the  .gov  domain),  and  (b)  a  Bing  generated  collection  (described  in  Section  2.1)  and  employed 
the  constructed  ranking  function  over  the  Million  Query  2009  (MQ09)  collection  (ClueWeb09  corpus  and 
general  web  queries). 

Furthermore,  even  within  a  certain  retrieval  environment  (represented  by  a  given  collection)  different 
queries  may  have  radically  different  characteristics  and  thus  different  features  may  better  capture  the  notion 
of  relevance.  For  instance,  in  the  case  of  precision-oriented  queries,  such  as  homepage/namepage  finding, 
the  url  of  a  document  or  its  popularity  may  be  more  indicative  of  the  document  relevance  than  the  document 
text  itself,  while  for  informational  queries  the  document  url  maybe  less  indicative  than  its  text.  Most  of 
the  existing  learning-to-rank  approaches  train  a  single  ranking  function  to  handle  all  queries.  Hence,  the 
question  that  arises  is  whether  training  a  different  ranking  function  for  each  one  of  these  different  query 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
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categories  and  using  the  appropriate  ranking  function  when  a  user  submits  a  new  query  may  lead  to  better 
retrieval  performance  than  employing  a  single  ranker  for  all  queries. 

Geng  et  al.  [4]  proposed  query-dependent  ranking.  According  to  their  approach  when  a  user  submits  a 
query  the  K  Nearest  Neighbor  (KNN)  method  is  employed  to  identify  the  closest  queries  to  the  submitted 
one  and  a  ranker  is  trained  over  these  K  queries.  This  query-dependent  ranker  is  then  used  to  rank  documents 
with  respect  to  the  submitted  query.  The  KNN  query-dependent  ranker  model  was  then  compared  against 
the  single  ranker  model  (baseline)  and  a  query  classification  ranker  model.  According  to  the  latter  model 
different  rankers  were  constructed  for  homepage  finding,  namepage  finding  and  topic  distillation.  The  KNN 
approach  slightly  outperformed  the  other  two  (with  an  nDCG  difference  of  approximately  0.01-0.02). 

Web  queries  however  may  be  classified  based  on  a  variety  of  different  criteria,  such  as  query  hardness, 
user  intent,  ambiguity,  popularity  (frequency  in  a  query  log),  spatial  or  temporal  characteristics,  length, 
click-through  rates,  verbosity  (i.e.  a  natural  language  query  versus  a  set  of  keywords).  In  this  work  we 
explore  whether  constructing  different  ranking  functions  for  different  query  intent  and  query  hardness  leads 
to  better  retrieval  results  than  constructing  a  single  ranker  for  all  queries.  However,  given  that  we  trained  over 
the  MQ08  collection,  our  ability  to  answer  this  question  highly  depends  on  the  robustness  of  the  learning- 
to-rank  algorithm  when  trained  and  tested  over  different  collections  (that  is,  it  highly  depends  on  the  answer 
to  the  first  question  posed). 

2  Methodology 

We  indexed  the  GOV2  collection  and  the  ClueWeb09  (category  B)  collection  using  the  Indri  Search  Engine 
from  the  Lemur  Toolkit  [1].  The  constructed  indexes  contained  five  fields:  document,  title,  heading,  anchor 
text,  and  url;  this  allowed  extracting  features  from  all  these  different  fields. 

Based  on  the  Indri  indexes,  we  extracted  a  total  of  57  LETOR4.0  -like  features  for  each  one  of  the  query- 
document  pairs.  We  did  not  utilize  the  available  LETOR4.0  MQ08  training  set  [2]  so  that  the  extracted 
features  from  ClueWeb09  would  be  comparable  to  the  ones  from  MQ08.  A  summary  of  the  features  can  be 
seen  in  Table  1.  Text  features  were  extracted  for  all  5  fields. 

The  MQ08  data  set  (training  data)  consists  of  784  queries  from  the  Million  Query  2008  collection.  403 
queries  had  8  documents  labeled  with  relevance  judgments,  204  queries  had  16  documents  labeled,  102 
queries  had  32  documents  labeled,  50  queries  had  64  documents  labeled  and  25  queries  had  128  documents 
labeled.  The  corpus  of  the  MQ08  collection  is  the  GOV2  corpus  and  the  queries  in  the  collection  had  at  least 
one  click  on  documents  in  the  .gov  collection.  Given  the  computational  complexity  of  feature  extraction  and 
training  features  were  extracted  only  from  the  union  of  (a)  top  1 ,000  documents  ranked  by  Indri  language 
model  [1]  over  the  document  text,  (b)  the  top  500  documents  ranked  by  Indri  over  the  anchor  text,  and  (c)  the 
top  500  documents  ranked  by  Indri  over  the  url  of  the  documents.  A  state-of-the-art  SVM  learning-to-rank 
algorithm  [5]  was  employed  to  construct  the  ranking  functions. 

2.1  Training  and  testing  over  collections  with  different  characteristics 

Our  first  goal  was  to  explore  whether  training  over  a  collection  with  characteristics  different  than  the  collec¬ 
tion  the  ranking  function  will  be  deployed  in  can  still  lead  to  effective  retrieval. 

We  trained  the  SVM  over  the  MQ08  data  set  and  then  used  the  resulting  ranking  function  (NeuSvmBase) 
to  re -rank  the  2,000  documents  (described  above)  per  query  in  the  new  MQ09  collection. 

There  are  two  striking  differences  between  the  MQ08  and  the  MQ09  collections.  (1)  the  MQ08  collec¬ 
tion  is  on  a  .gov  corpus  with  .gov  related  queries,  while  the  MQ09  collection  is  a  general  Web  collection. 


Text  Features 


Tl.  length 

T2  tF 

T3.  IDF 

T4.  TF*IDF 

T5.  normalized  TF 

T6.  Robertson’s  TF 

T7.  Robertson’s  IDF 

T8.  BM25 

T9.  Language  Model  (Laplaee  Smoothing) 

TIO.  Language  Model  (Diriehlet  Smoothing) 

TIL  Language  Model  (JM  Smoothing) 


Web  Features 

W 1 .  number  of  ineoming  links 

W2.  number  of  ineoming  links  from  different  domains 

Table  1:  Features  extraeted  from  the  MQ08  (GOV2),  the  Bing  generated  and  the  MQ09  (ClueWeb09  eate- 
gory  B)  eolleetions. 


and  (2)  the  MQ08  is  a  spam-free  eolleetion  while  the  MQ09  is  not. 

In  order  to  be  able  to  separate  the  eonllated  effeets  of  these  two  different  eharaeteristies  of  the  two  eol¬ 
leetions,  we  eonstrueted  a  seeond  ranker  (NeuSvmStefan).  We  again  used  the  SVM  ranking  algorithm  but 
instead  of  training  over  the  MQ08  eolleetion  we  first  obtained  queries  from  the  query  log  of  a  eommer- 
eial  seareh  engine  (different  than  the  MQ09  queries),  then  we  submitted  these  queries  to  Bing  and  finally 
frained  over  fhe  inferseefion  of  fhe  lop  100  doeumenls  relumed  by  Bing  and  Ihe  doeumenls  ineluded  in  Ihe 
ClueWeb09  ealegory  B  erawl.  Sinee  Ihere  were  no  relevanee  judgmenls  for  fhe  relumed  by  Bing  doeu¬ 
menls  we  used  Ihe  reverse  of  Ihe  doeumenl  ranks.  This  Bing-generaled  Iraining  eolleelion  has  Ihe  same 
eharaelerislies  as  Ihe  MQ09  eolleelion  exeepl  lhal  il  is  (effeelively)  spam-free. 

2.2  Query-dependent  rankings 

The  seeond  question  we  attempted  lo  answer  was  whelher  eonslrueling  separate  ranking  funelion  for  dif- 
ferenl  query  ealegories  ean  lead  lo  more  effeelive  relrieval  lhan  eonslrueling  a  single  ranker.  The  query 
eharaelerislies  we  used  lo  ealegorize  queries  were  query  inlenl  (i.e.  preeision-orienled  queries  vs.  reeall- 
orienled  queries)  and  query  hardness  (i.e.  hard  vs.  easy  queries). 

The  Iraining  dala  sel  was  Ihe  MQ08  dala  set  Regarding  Ihe  query  inlenl  we  manually  elassified  Ihe 
784  queries  in  Ihe  MQ08  eolleetion.  The  Average  Average  Preeision  (AAP)  based  on  Ihe  Average  Preeision 
seores  of  Ihe  parlieipaling  runs  in  Ihe  MQ08  Iraek  was  utilized  lo  elassify  queries  into  hard  and  easy. 

Then,  we  Irained  Ihree  sels  of  ranking  funelions.  The  firsl  sel  eonsisled  of  Iwo  rankers,  one  over  Ihe 
preeision-orienled  queries  and  one  over  Ihe  reeall-orienled  ones  (NeuSvmPR).  The  seeond  sel  eonsisled 
again  of  Iwo  rankers,  one  over  Ihe  hard  and  one  over  Ihe  easy  queries  (NeuSvmHE),  while  Ihe  Iasi  sel 
eonsisled  of  four  rankers,  a  ranker  for  eaeh  one  of  Ihe  four  eombinalions,  i.e.  preeision-orienled  and  hard, 
reeall-orienled  and  hard,  ele.  (NeuSvmPRHE). 


Eoldl 

Pold2 

Pold3 

Pold4 

Pold5 

MAP 

0.283 

0.294 

0.325 

0.300 

0.278 

Table  2:  5-fold  cross  validation  on  all  judged  documents  of  MQ09  using  regression 


For  MQ09  queries,  in  order  to  determine  what  ranker  to  use,  we  had  to  estimate  the  intent  (precision  vs. 
recall)  and  the  difficulty  (hard  vs.  easy): 

•  Precision  vs.  Recall:  We  constructed  an  SVM  classifier,  frained  if  over  MQ08  queries  using  as  feafures 
fhe  average  feafure  values  of  fhe  relevanf  documenf  for  each  query  wifh  fhe  manual  labels.  We  applied 
fhe  classifier  over  fhe  MQ09  queries  using  as  feafures  fhe  average  feafures  of  fhe  fop  100  documenfs 
(pseudo-relevanf). 

•  Hard  vs.  Easy:  We  used  fhe  Jensen-Shannon  disfance  of  fhe  documenf  rankings  generafed  by  a  number 
of  ranking  funcfions  (e.g.  BM25,  Language  Models  efc.)  on  fhe  MQ09  collecfion  [3]. 

In  bofh  classificalions  a  hard  decision  of  whefher  a  new  query  is  precision-oriented  or  recall-orienfed 
or  whefher  if  is  hard  or  easy  was  made.  Then  fhe  appropriafe  ranker  was  used  fo  re -rank  fhe  fop  2,000 
documenfs  originally  ranked  by  fhe  Indri  language  model  as  described  above. 

3  Results  and  Discussion 

The  baseline  run  of  fhe  single  ranker  frained  over  fhe  Million  Query  2008  collecfion  (NeuSvmBase)  resulfed 
in  an  estimated  Mean  Average  Precision  (MAP)  of  0.089,  a  particularly  low  score.  To  fesf  fhe  correcfness  of 
our  SVM  ranking  algorifhm  and  fhe  correcfness  of  fhe  feafure  exfracfor  we  exfracfed  feafures  from  fhe  Mil¬ 
lion  Query  2008  collecfion,  and  performed  a  five-fold  cross  validafion.  The  refrieval  performance  achieved 
was  af  teas!  as  good  as  fhe  LETOR  4.0  baselines.  Thus,  fhese  resulfs  indicate  fhaf  fraining  over  a  collection 
of  given  characferisfics  cannof  always  lead  fo  an  effecfive  ranking  funcfion  when  fhe  funcfion  is  deployed  fo 
rank  documenfs  in  a  collecfion  of  radically  differenl  characferisfics. 

Our  second  run  of  fhe  single  ranker  frained  over  fhe  Bing-generafed  collecfion  (NeuSvmSfefan)  per¬ 
formed  slighfly  worse  fhan  fhe  baseline  achieving  an  MAP  score  of  0.084.  As  menfioned  earlier  fhe  Bing- 
generafed  fraining  collecfion  has  similar  characteristics  fo  fhe  Million  Query  2009  collection  excepf  fhaf 
if  is  spam-free.  Therefore,  a  reasonable  assumpfion  is  fhaf  fraining  over  a  spam-free  collecfion  and  using 
fhe  ranker  over  a  collecfion  fhaf  includes  spam  leads  fo  low  refrieval  effecliveness.  We  manually  inspecfed 
fhe  lop  documenfs  of  our  baseline  ranker  and  observed  fhaf  allhough  fhe  original  by  fhe  language  model 
ranking  included  many  relevanf  documenfs  af  fhe  lop  posilions  fhe  learning-lo-rank  algorifhm  boosled  spam 
documenfs  towards  fhe  top  posilion  of  fhe  re -ranked  lisl.  An  inleresling  queslion  fhaf  arises  here  is  how 
much  fhe  effecliveness  of  our  rankings  could  improve  by  simply  removing  all  fhe  relumed  spam  documenfs. 
Eurlher,  a  fulure  direclion  of  research  would  be  fo  compare  fhe  refrieval  effecliveness  of  rankers  frained  over 
colleclions  fhaf  includes  spam  and  rankers  frained  over  spam-free  colleclions  wifh  anli-spamming  applied 
over  fhe  resulfs  of  bofh  rankers. 

Afler  oblaining  fhe  query  calegories  and  judgmenls  form  TREC  MQ09,  we  frained  and  lesled  our  rank¬ 
ing  funcfion  over  fhe  MQ09  dafa  by  performed  a  5-fold  cross  validation.  When  lesling  our  ranking  function 
we  only  considered  fhe  judged  documenfs  in  a  LETOR-like  manner  and  Ihus  fhe  performance  of  our  ranker 
is  nol  comparable  fo  fhe  ones  reporled  by  TREC  for  our  submilled  rankers.  However,  fhe  resulfs  for  each 
fold  can  be  viewed  in  Table  2  and  fhe  mean  average  precision  achieved  was  0.296. 


Train 

Test 

MAP 

Reeall 

Preeision 

0.37 

Preeision 

Preeision 

0.38 

Preeision 

Reeall 

0.49 

Reeall 

Reeall 

0.52 

Table  3:  Training  &  Testing  on  different  query  eategories  over  all  judged  doeuments  of  MQ09  using  regres¬ 
sion 


Easy 

Medium 

Hard 

Preeision 

Reeall 

predieted  Hard 

115 

91 

63 

predieted  Preeision 

81 

108 

predieted  Easy 

107 

103 

121 

predieted  Reeall 

176 

235 

Table  4:  Hard  vs.  Easy  Queries  Table  5:  Preeision  vs  Reeall  Queries 


Given  the  failure  of  the  baseline  learning-to-rank  algorithm  to  learn  an  effeetive  for  the  Million  Query 
2009  eolleetion  ranking  funetion,  we  eould  not  answer  our  seeond  question  of  whether  query-dependent 
rankings  over  predefined  query  eategories  ean  lead  to  signifieant  improvements  when  eompared  with  a 
single  ranker  for  all  queries.  The  estimated  MAP  seores  were  all  slightly  lower  than  the  baseline  but  any 
eonelusions  would  be  misleading. 

However,  we  eondueted  a  similar  experiment  over  the  MQ09  eolleetion  after  the  qrels  and  the  query 
eategories  were  released  by  the  Traek.  We  built  training  and  testing  data  sets  based  on  the  different  user 
intent  (i.e.  one  preeision-oriented  query  data  set  and  one  reeall-oriented  query  data  set).  There  were  in 
a  total  of  176  preeision-oriented  and  230  reeall-oriented  queries  labeled  by  NIST  assessors.  We  trained 
two  different  ranking  funetions  over  the  two  training  sets  and  tested  the  ranking  funetions  both  against  the 
preeision-  and  against  the  reeall-oriented  testing  sets.  The  results  (shown  in  Table  3)  illustrate  that  a  ranking 
funetion  trained  over  preeision-oriented  queries  outperforms  a  ranking  funetion  trained  over  reeall-oriented 
queries  when  the  testing  set  eonsists  of  preeision-oriented  queries  only,  while  the  opposite  is  true  in  the  ease 
of  a  reeall-oriented  query  test  set,  as  expeeted.  A  further  observation  is  that  reeall-oriented  queries  appear 
to  be  more  useful  for  training  than  preeision-oriented  ones,  sinee  the  performanee  differenee  between  the 
two  rankers  is  large  when  the  test  set  eonsists  of  reeall-oriented  queries  but  very  small  when  it  eonsists  of 
preeision-oriented  ones.  This  however  needs  further  investigation  sinee  the  set  of  reeall-oriented  queries 
was  mueh  larger  that  the  set  of  preeision-oriented  queries  whieh  eould  also  have  affeeted  the  effeetiveness 
of  the  trained  ranking  funetions. 

Finally,  the  results  of  our  elassifieation  ean  be  viewed  in  Tables  4  and  5.  The  query  intent  elassifier 
seemed  to  biased  towards  reeall-oriented  queries  by  elassifying  411  out  of  600  queries  as  reeall-oriented.  Out 
of  those  41 1  predieted  reeall  queries  235  where  assessed  as  reeall  queries  by  the  judges.  The  Jensen-Shannon 
methodology  to  elassify  queries  based  on  their  hardness  does  not  seem  to  have  done  well  either.  Given  that 
Jensen-Shannon  has  been  shown  to  prediet  query  hardness  well  when  it  is  applied  over  rankings  by  TREC 
runs  this  may  indieate  that  applying  the  same  methodology  over  rankings  by  basie  ranking  funetions  (e.g. 
BM25,  EM)  does  not  lead  to  equally  good  predietions. 
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